import numpy as np
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
import seaborn as sns
!pip install plotly
WARNING: Ignoring invalid distribution -andas (/Users/bharat/opt/anaconda3/lib/python3.9/site-packages) WARNING: Ignoring invalid distribution -andas (/Users/bharat/opt/anaconda3/lib/python3.9/site-packages) Collecting plotly Downloading plotly-5.11.0-py2.py3-none-any.whl (15.3 MB) |████████████████████████████████| 15.3 MB 202 kB/s eta 0:00:01 Collecting tenacity>=6.2.0 Downloading tenacity-8.1.0-py3-none-any.whl (23 kB) WARNING: Ignoring invalid distribution -andas (/Users/bharat/opt/anaconda3/lib/python3.9/site-packages) Installing collected packages: tenacity, plotly WARNING: Ignoring invalid distribution -andas (/Users/bharat/opt/anaconda3/lib/python3.9/site-packages) WARNING: Ignoring invalid distribution -andas (/Users/bharat/opt/anaconda3/lib/python3.9/site-packages) Successfully installed plotly-5.11.0 tenacity-8.1.0 WARNING: Ignoring invalid distribution -andas (/Users/bharat/opt/anaconda3/lib/python3.9/site-packages) WARNING: Ignoring invalid distribution -andas (/Users/bharat/opt/anaconda3/lib/python3.9/site-packages) WARNING: Ignoring invalid distribution -andas (/Users/bharat/opt/anaconda3/lib/python3.9/site-packages)
CONTEXT: Company X manages the men's top professional basketball division of the American league system. The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.
OBJECTIVE: Company’s management wants to invest on proposal on managing some of the best teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them.
**Steps and tasks:
# import dataset
bb = pd.read_csv("basketball.csv")
# display samples
bb.head()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931to32 | 1 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
sns.pairplot(bb)
<seaborn.axisgrid.PairGrid at 0x7fc38f024cd0>
# Shape of the dataset
bb.shape
(61, 13)
# column names
bb.columns
Index(['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'TeamLaunch', 'HighestPositionHeld'],
dtype='object')
# quantitative data description
bb.describe()
| Tournament | HighestPositionHeld | |
|---|---|---|
| count | 61.000000 | 61.000000 |
| mean | 24.000000 | 7.081967 |
| std | 26.827225 | 5.276663 |
| min | 1.000000 | 1.000000 |
| 25% | 4.000000 | 3.000000 |
| 50% | 12.000000 | 6.000000 |
| 75% | 38.000000 | 10.000000 |
| max | 86.000000 | 20.000000 |
# datatype & missing data overview
bb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null object 3 PlayedGames 61 non-null object 4 WonGames 61 non-null object 5 DrawnGames 61 non-null object 6 LostGames 61 non-null object 7 BasketScored 61 non-null object 8 BasketGiven 61 non-null object 9 TournamentChampion 61 non-null object 10 Runner-up 61 non-null object 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(2), object(11) memory usage: 6.3+ KB
bb.columns[1:]
Index(['Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'TeamLaunch', 'HighestPositionHeld'],
dtype='object')
# convert elements to int
for i in bb.columns[1:]:
for j in bb.index:
try: # errors generated, hence lets exception handling
bb.loc[j,i]=int(bb.loc[j,i])
except:
print(i,j,bb.loc[j,i],sep=' : ',end='\n')
Score : 60 : - PlayedGames : 60 : - WonGames : 60 : - DrawnGames : 60 : - LostGames : 60 : - BasketScored : 60 : - BasketGiven : 60 : - TournamentChampion : 6 : - TournamentChampion : 8 : - TournamentChampion : 11 : - TournamentChampion : 12 : - TournamentChampion : 13 : - TournamentChampion : 14 : - TournamentChampion : 15 : - TournamentChampion : 16 : - TournamentChampion : 17 : - TournamentChampion : 18 : - TournamentChampion : 19 : - TournamentChampion : 20 : - TournamentChampion : 21 : - TournamentChampion : 22 : - TournamentChampion : 23 : - TournamentChampion : 24 : - TournamentChampion : 25 : - TournamentChampion : 26 : - TournamentChampion : 27 : - TournamentChampion : 28 : - TournamentChampion : 29 : - TournamentChampion : 30 : - TournamentChampion : 31 : - TournamentChampion : 32 : - TournamentChampion : 33 : - TournamentChampion : 34 : - TournamentChampion : 35 : - TournamentChampion : 36 : - TournamentChampion : 37 : - TournamentChampion : 38 : - TournamentChampion : 39 : - TournamentChampion : 40 : - TournamentChampion : 41 : - TournamentChampion : 42 : - TournamentChampion : 43 : - TournamentChampion : 44 : - TournamentChampion : 45 : - TournamentChampion : 46 : - TournamentChampion : 47 : - TournamentChampion : 48 : - TournamentChampion : 49 : - TournamentChampion : 50 : - TournamentChampion : 51 : - TournamentChampion : 52 : - TournamentChampion : 53 : - TournamentChampion : 54 : - TournamentChampion : 55 : - TournamentChampion : 56 : - TournamentChampion : 57 : - TournamentChampion : 58 : - TournamentChampion : 59 : - TournamentChampion : 60 : - Runner-up : 6 : - Runner-up : 9 : - Runner-up : 11 : - Runner-up : 12 : - Runner-up : 15 : - Runner-up : 16 : - Runner-up : 17 : - Runner-up : 18 : - Runner-up : 21 : - Runner-up : 22 : - Runner-up : 23 : - Runner-up : 24 : - Runner-up : 25 : - Runner-up : 26 : - Runner-up : 27 : - Runner-up : 28 : - Runner-up : 29 : - Runner-up : 30 : - Runner-up : 31 : - Runner-up : 32 : - Runner-up : 33 : - Runner-up : 34 : - Runner-up : 35 : - Runner-up : 36 : - Runner-up : 37 : - Runner-up : 38 : - Runner-up : 39 : - Runner-up : 40 : - Runner-up : 41 : - Runner-up : 42 : - Runner-up : 43 : - Runner-up : 44 : - Runner-up : 45 : - Runner-up : 46 : - Runner-up : 47 : - Runner-up : 48 : - Runner-up : 49 : - Runner-up : 50 : - Runner-up : 51 : - Runner-up : 52 : - Runner-up : 53 : - Runner-up : 54 : - Runner-up : 55 : - Runner-up : 56 : - Runner-up : 57 : - Runner-up : 58 : - Runner-up : 59 : - Runner-up : 60 : - TeamLaunch : 3 : 1931to32 TeamLaunch : 5 : 1934-35 TeamLaunch : 8 : 1939-40 TeamLaunch : 9 : 1932-33 TeamLaunch : 10 : 1941to42 TeamLaunch : 11 : 1939-40 TeamLaunch : 12 : 1948-49 TeamLaunch : 15 : 1935-36 TeamLaunch : 17 : 1933to34 TeamLaunch : 18 : 1960-61 TeamLaunch : 19 : 1951-52 TeamLaunch : 20 : 1998-99 TeamLaunch : 21 : 1941-42 TeamLaunch : 22 : 1977-78 TeamLaunch : 23 : 1959-60 TeamLaunch : 24 : 2004to05 TeamLaunch : 25 : 1935-36 TeamLaunch : 26 : 1961-62 TeamLaunch : 27 : 1940-41 TeamLaunch : 28 : 1930-31 TeamLaunch : 29 : 1963-64 TeamLaunch : 30 : 1974-75 TeamLaunch : 31 : 1943-44 TeamLaunch : 32 : 1977-78 TeamLaunch : 33 : 1987-88 TeamLaunch : 34 : 1941-42 TeamLaunch : 36 : 2007-08 TeamLaunch : 37 : 1962-63 TeamLaunch : 38 : 1994-95 TeamLaunch : 39 : 1978-79 TeamLaunch : 40 : 1971-72 TeamLaunch : 41 : 1963-64 TeamLaunch : 42 : 1999to00 TeamLaunch : 43 : 2014-15 TeamLaunch : 45 : 1990-91 TeamLaunch : 46 : 1947-48 TeamLaunch : 47 : 1996-97 TeamLaunch : 48 : 1995-96 TeamLaunch : 49 : 1945-46 TeamLaunch : 50 : 1953-54 TeamLaunch : 52 : 1979-80 TeamLaunch : 54 : 1950-51 TeamLaunch : 56 : 2009-10 TeamLaunch : 57 : 1956-57 TeamLaunch : 58 : 1951~52 TeamLaunch : 59 : 1955-56 TeamLaunch : 60 : 2017~18
Type casting creates error where non numerical values are present
# count invalid hyphens in each row
bb['hyphen']=np.zeros((61,1))
for i in bb.index:
try:
bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-']
except:
next
bb.hyphen.value_counts()
/var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-'] /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/1750570746.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-']
2.0 46 0.0 8 1.0 6 9.0 1 Name: hyphen, dtype: int64
bb.loc[bb.hyphen>2].index[0]
60
as seen above, there are several "-" in row index 60 lets drop that row as the team has never played games, leaving us with no clues about their performance.
# drop 60th row
bb=bb.drop(index=60)
bb.shape
(60, 14)
# Lets impute the remaining "-" in the "TournamentChampion" and "Runner-up" columns
# assign zeros to number of tournaments won or been runner up
bb["TournamentChampion"].loc[bb["TournamentChampion"]=="-"]=0
bb["Runner-up"].loc[bb["Runner-up"]=="-"]=0
bb=bb.drop(columns="hyphen")
/var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/42600856.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb["TournamentChampion"].loc[bb["TournamentChampion"]=="-"]=0 /var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/42600856.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bb["Runner-up"].loc[bb["Runner-up"]=="-"]=0
# review the Team Launch column
bb.TeamLaunch.to_frame().T
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TeamLaunch | 1929 | 1929 | 1929 | 1931to32 | 1929 | 1934-35 | 1929 | 1929 | 1939-40 | 1932-33 | ... | 1953-54 | 1929 | 1979-80 | 1929 | 1950-51 | 201617 | 2009-10 | 1956-57 | 1951~52 | 1955-56 |
1 rows × 60 columns
The TeamLaunch column contains year of launch, either in Gregorian calendar year like YYYY or probably financila/academic year ranges like YYYY-YY Hence lets extract the Launch Year in Gregorian Format of YYYY alone
from datetime import date as dt
# Converting 'to' to '-'
for i in bb.index:
if type(bb.TeamLaunch.iloc[i]) is str:
bb.TeamLaunch.iloc[i]=bb.TeamLaunch.iloc[i].replace('to','-')
# extracting sets of 4 digits, delimiter '-' or '~' and 2 digits
tl=bb.TeamLaunch.str.extract(r'(\d\d\d\d)?([-~])?(\d\d)')
for i in bb.index:
j=bb.TeamLaunch.iloc[i]
if type(j) is str: # copying back above extracted years to TeamLaunch
bb.TeamLaunch.iloc[i]=int(tl.iloc[i,0])
elif type(j) is int:
if j>9999: # extracting year from YYYYYY formated numbers
bb.TeamLaunch.iloc[i]=int(np.divmod(j,100)[0])
/var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/2052726708.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
bb.TeamLaunch.iloc[i]=bb.TeamLaunch.iloc[i].replace('to','-')
/var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/2052726708.py:13: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
bb.TeamLaunch.iloc[i]=int(tl.iloc[i,0])
/var/folders/4b/2k_24hqn61s6p8w3pk7g57500000gn/T/ipykernel_4864/2052726708.py:16: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
bb.TeamLaunch.iloc[i]=int(np.divmod(j,100)[0])
# now lets convert all the colums (expect Team) in to integer type
for i in bb.columns[1:]:
try:
bb[i]=bb[i].astype(int)
except:
print("error in",i)
All columns converted without error
Rearrange dataset and convert Team name as the index and lets check the info & description
bb=bb[['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'TeamLaunch', 'HighestPositionHeld']]
bb=bb.set_index('Team')
bb.info()
<class 'pandas.core.frame.DataFrame'> Index: 60 entries, Team 1 to Team 60 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Tournament 60 non-null int64 1 Score 60 non-null int64 2 PlayedGames 60 non-null int64 3 WonGames 60 non-null int64 4 DrawnGames 60 non-null int64 5 LostGames 60 non-null int64 6 BasketScored 60 non-null int64 7 BasketGiven 60 non-null int64 8 TournamentChampion 60 non-null int64 9 Runner-up 60 non-null int64 10 TeamLaunch 60 non-null int64 11 HighestPositionHeld 60 non-null int64 dtypes: int64(12) memory usage: 6.1+ KB
Perform detailed statistical analysis and EDA using univariate, bi-variate and multivariate EDA techniques to get data driven insights on recommending which teams they can approach which will be a deal win for them. Also as a data and statistics expert you have to develop a detailed performance report using this data.
bb.describe()
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.00000 | 60.000000 | 60.000000 | 60.000000 |
| mean | 24.383333 | 916.450000 | 810.100000 | 309.033333 | 192.083333 | 308.816667 | 1159.350000 | 1159.233333 | 1.45000 | 1.433333 | 1957.950000 | 7.050000 |
| std | 26.884620 | 1138.342899 | 877.465393 | 408.481395 | 201.985508 | 294.508639 | 1512.063948 | 1163.946914 | 5.51554 | 4.574679 | 26.646732 | 5.315232 |
| min | 1.000000 | 14.000000 | 30.000000 | 5.000000 | 4.000000 | 15.000000 | 34.000000 | 55.000000 | 0.00000 | 0.000000 | 1929.000000 | 1.000000 |
| 25% | 4.000000 | 104.250000 | 115.500000 | 34.750000 | 26.250000 | 62.750000 | 154.500000 | 236.000000 | 0.00000 | 0.000000 | 1934.750000 | 3.000000 |
| 50% | 12.000000 | 395.500000 | 424.500000 | 124.000000 | 98.500000 | 197.500000 | 444.000000 | 632.500000 | 0.00000 | 0.000000 | 1950.500000 | 6.000000 |
| 75% | 39.000000 | 1360.500000 | 1345.500000 | 432.750000 | 331.500000 | 563.500000 | 1669.750000 | 2001.250000 | 0.00000 | 0.000000 | 1977.250000 | 10.000000 |
| max | 86.000000 | 4385.000000 | 2762.000000 | 1647.000000 | 633.000000 | 1070.000000 | 5947.000000 | 3889.000000 | 33.00000 | 25.000000 | 2016.000000 | 20.000000 |
The attributes of the data are varying in scales ranging from 10s to 1000s
From above statistics we are able to find significant outliers in 3 columns. Those outliers in Score, WonGames, BasketScored cannot be excluded from datapoints as those exceptional performances are defining the top teams
bb.hist(layout=(3,4),figsize=(15,10));
The attributes do not follow normal distribution, probably because of various generations of teams being compared here (TeamLaunch ranges over 60 years)
lets study attribute-wise distribution to get a better picture
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=2,cols=1)
fig.update_layout(
updatemenus=[
dict(
active=0,
buttons=list([
dict(label="All",method="update",
args=[{"visible":[True,True,True,True,True,True,True,True,True,True,True,True,
True,True,True,True,True,True,True,True,True,True,True,True]},
{"title":"All data points shown, select from list -->"
+
'+'"Inferences will be written alongside each plot",
"annotations":[]}]),
dict(label="Tournament",method="update",
args=[{"visible":[True,True,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"Tournament",
"annotations":[dict(text="Number of tournaments played is Right Skewed",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="24 of the teams have played less than 8 tournamnets",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="Score",method="update",
args=[{"visible":[False,False,True,True,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"Score",
"annotations":[dict(text="5 outstanding (outlier) teams with scores more than 2819",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="more than 50% of teams (32 teams) have scored less than 430",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="PlayedGames",method="update",
args=[{"visible":[False,False,False,False,True,True,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"PlayedGames",
"annotations":[dict(text="number of games played is Right Skewed (similar to tournaments)",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="24 of the teams have played less than 275 games",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="WonGames",method="update",
args=[{"visible":[False,False,False,False,False,False,True,True,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"WonGames",
"annotations":[dict(text="5 outstanding (outlier) teams have won more than 990 games",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="more than 50% of teams (35 teams) have won less than 165 games",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="DrawnGames",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,True,True,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"DrawnGames",
"annotations":[dict(text="number of games drawin is also Right Skewed",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="again, 23 of the teams have have drawn 61 games",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="LostGames",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,True,True,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"LostGames",
"annotations":[dict(text="number of games lost is also Right Skewed",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="22 of the teams have lost upto 105 games",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="BasketScored",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
True,True,False,False,False,False,False,False,False,False,False,False]},
{"title":"BasketScored",
"annotations":[dict(text="5 outstanding (outlier) teams have scored more than 3680 baskets",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="more than 50% of teams (33 teams) have scored less than 600 baskets",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="BasketGiven",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,True,True,False,False,False,False,False,False,False,False]},
{"title":"BasketGiven",
"annotations":[dict(text="The Right Skew shows that all teams have significantly given baskets, meaning high competition",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="23 of the teams have given less than 380 baskets",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="TournamentChampion",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,True,True,False,False,False,False,False,False]},
{"title":"TournamentChampion",
"annotations":[dict(text="less than 10 teams had been champions",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="the fences merge due to the fact that major teams have never been champions",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="Runner-up",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,True,True,False,False,False,False]},
{"title":"Runner-up",
"annotations":[dict(text="little more than 10 teams had been runners",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="the fences merge due to the fact that major teams have never been runners",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="TeamLaunch",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,True,True,False,False]},
{"title":"TeamLaunch",
"annotations":[dict(text="wide aged teams are considered in the dataset, with significant number of teams formed much earlier",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="upto 11 teams formed before 1931, new team formation peaked around 1945 and again around 1995",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="HighestPositionHeld",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,True,True]},
{"title":"HighestPositionHeld",
"annotations":[dict(text="a right skew means that, quite a large number of teams have secured better positions (lower positional value means better rank)",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="18 teams had secured top 3 positions at the highest",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
]),
direction="down",
showactive=True,
x=0.8,
xanchor="left",
y=1.2,
yanchor="top"
)
]
)
fig.update_layout(showlegend=False,title="All data points shown, select from list -->"
+
'+'"Inferences will be written alongside each plot")
for i in bb.columns:
fig.add_trace(go.Box(x=bb[i],name='',hovertemplate='%{x}',jitter=1),1,1)
fig.add_trace(go.Histogram(x=bb[i],name=i,xbins_size=(bb[i].max()-bb[i].min())/10),2,1)
fig.show()
With most of the attributes being right skewed and none following a normal distribution, it will be difficult to determine better performing teams
sns.pairplot(bb); # scatter plot between all variable combinations, depicting any possible relations
Quite a lot of attributes are found to be related either positively or inversely
let us review the correlation coefficient to measure the relationships
plt.figure(figsize=(14,10))
sns.heatmap(bb.corr(),annot=True,vmax=1,vmin=-1,cmap='RdYlGn');
bb["TeamAge"]=2021-bb.TeamLaunch
bb["Level"]=21-bb.HighestPositionHeld
bb["WinRatio"]=bb.WonGames.div(bb.PlayedGames)
bb["LoseRatio"]=bb.LostGames.div(bb.PlayedGames)
bb["DrawRatio"]=bb.DrawnGames.div(bb.PlayedGames)
bb["PlayRate"]=bb.PlayedGames.div(bb.TeamAge)
newbb=bb[['PlayedGames','WinRatio','DrawRatio','LoseRatio','BasketGiven','TournamentChampion','Runner-up',"TeamAge","Level","PlayRate"]]
newbb.head()
| PlayedGames | WinRatio | DrawRatio | LoseRatio | BasketGiven | TournamentChampion | Runner-up | TeamAge | Level | PlayRate | |
|---|---|---|---|---|---|---|---|---|---|---|
| Team | ||||||||||
| Team 1 | 2762 | 0.596307 | 0.199855 | 0.203838 | 3140 | 33 | 23 | 92 | 20 | 30.021739 |
| Team 2 | 2762 | 0.572411 | 0.207458 | 0.220130 | 3114 | 25 | 25 | 92 | 20 | 30.021739 |
| Team 3 | 2614 | 0.474751 | 0.228768 | 0.296480 | 3309 | 10 | 8 | 92 | 20 | 28.413043 |
| Team 4 | 2664 | 0.445571 | 0.231231 | 0.323198 | 3469 | 6 | 6 | 90 | 20 | 29.600000 |
| Team 5 | 2762 | 0.437726 | 0.229182 | 0.333092 | 3700 | 8 | 7 | 92 | 20 | 30.021739 |
# Lets us study the relationships with refined attributes of the set of teams
plt.figure(figsize=(14,10))
sns.heatmap(newbb.corr(),annot=True,vmax=1,vmin=-1,cmap='RdYlGn');
Now, the refined attributes define the group of teams more accurately
Having arrived at meaningful qualities of the group of teams one may intuitively choose teams with high WinRatio to invest on So lets see if that is a worthy of investment
from plotly import express as px
fig=px.scatter(hover_data=[newbb.index],data_frame=newbb,x='TeamAge',y='WinRatio',
size=newbb.Level,color='PlayRate',color_continuous_scale='Turbo')
fig.add_annotation(x=23, y=0.44,text="Team21",showarrow=False)
fig.add_annotation(x=17, y=0.35,text="Team25",showarrow=False)
Interestingly, Yes the teams with high WinRatios have been TournamentChampions for several times (Teams 1 to 5) But those are the oldest teams amongst the group, and are expected to have been contract with Competitors
So who are we left with? with only young teams!!! Surprisingly, among teams not older than 25 years, there are 2 budding performers with high perseverence Teams 21 & 25 has shown high interest to frequently play
# Compare teams of choice
# Against teams of age<=25
Compare2 = pd.DataFrame(columns=newbb.columns)
Compare2.loc["Mean",:]=newbb.loc[bb.TeamAge<=25].mean()
Compare2.loc["Std",:]=newbb.loc[bb.TeamAge<=25].std()
Compare2.loc["Team 21",:]=newbb.loc["Team 21",:]
Compare2.loc["Team 25",:]=newbb.loc["Team 25",:]
Compare2
| PlayedGames | WinRatio | DrawRatio | LoseRatio | BasketGiven | TournamentChampion | Runner-up | TeamAge | Level | PlayRate | |
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | 219.0 | 0.278444 | 0.259735 | 0.461821 | 305.75 | 0.0 | 0.125 | 15.625 | 8.875 | 13.544709 |
| Std | 220.044151 | 0.067426 | 0.020349 | 0.064536 | 273.148181 | 0.0 | 0.353553 | 7.443837 | 6.490377 | 9.977962 |
| Team 21 | 646.0 | 0.411765 | 0.266254 | 0.321981 | 789.0 | 0.0 | 1.0 | 23.0 | 19.0 | 28.086957 |
| Team 25 | 456.0 | 0.322368 | 0.245614 | 0.432018 | 633.0 | 0.0 | 0.0 | 17.0 | 15.0 | 26.823529 |
CONTEXT: Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience, present in person and online.
OBJECTIVE: Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from the dataset.
#Read the CSV file
prod=pd.read_csv("Company.csv")
prod.head()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
# shape & size of the dataset
prod.shape
(662, 6)
#Data Exploration
prod.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 662 entries, 0 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 662 non-null object 1 Product 656 non-null object 2 Funding 448 non-null object 3 Event 662 non-null object 4 Result 662 non-null object 5 OperatingState 662 non-null object dtypes: object(6) memory usage: 31.2+ KB
All the attributes are found to be of object datatype Going forwards, Funding column must be considered for appropriate conversion
#Check for null values in the attributes
prod.isna().value_counts().to_frame()
| 0 | ||||||
|---|---|---|---|---|---|---|
| Startup | Product | Funding | Event | Result | OperatingState | |
| False | False | False | False | False | False | 446 |
| True | False | False | False | 210 | ||
| True | True | False | False | False | 4 | |
| False | False | False | False | 2 |
There are a total of 216 counts of nulls and Nans
#Dropping the null values
prod.dropna(axis=0,how='any',inplace=True)
prod.isna().value_counts().to_frame()
| 0 | ||||||
|---|---|---|---|---|---|---|
| Startup | Product | Funding | Event | Result | OperatingState | |
| False | False | False | False | False | False | 446 |
# Convert the ‘Funding’ features to a numerical value.
prod.Funding.head()
1 $630K 3 $1M 5 $19.3M 6 $29M 10 $1.8M Name: Funding, dtype: object
# check for variety of currency units
prod.Funding.apply(lambda x: x[0]).value_counts()
$ 446 Name: Funding, dtype: int64
# only Dollars found
# check for variety of multiplier units
prod.Funding.apply(lambda x: x[-1]).value_counts()
M 309 K 136 B 1 Name: Funding, dtype: int64
# three different multipliers found
# check for negative sign in 2nd position
prod.Funding.apply(lambda x: x[1]).value_counts()
1 149 2 81 3 60 5 35 6 33 4 30 7 25 8 19 9 14 Name: Funding, dtype: int64
import re
wildcard=r"""([$])(\d*\.*\d*)([KMB])"""
# extract components of the string
funds=prod.Funding.apply(lambda x:re.findall(wildcard,x))
# type casting to float & apply MILLION conversion
prod.Funding=funds.apply(lambda x:float(x[0][1])/1000 if x[0][2]=='K' else
float(x[0][1]) if x[0][2]=='M' else
float(x[0][1])*1000)
# review conversion to MILLION units
prod.Funding.head()
1 0.63 3 1.00 5 19.30 6 29.00 10 1.80 Name: Funding, dtype: float64
fig=go.Figure()
fig.add_trace(go.Box(x=prod.Funding,jitter=0,name='Funding',boxpoints='all'))
fig.add_annotation(x=400, y=-0.45,showarrow=False,
text="the data is heavily right skewed"'+'
'+'
"giving huge number of outliers")
# outlier summary
q1=prod["Funding"].quantile(q=0.25,interpolation='nearest') #quantiles
q3=prod["Funding"].quantile(q=0.75,interpolation='nearest')
IQR=q3-q1
lf=q1-(1.5*IQR)
lower_fence=np.sort(prod.loc[prod["Funding"]>=lf,"Funding"])[0] #whiskers
uf=q3+(1.5*IQR)
upper_fence=np.sort(prod.loc[prod["Funding"]<=uf,"Funding"])[-1]
print("Outlier count:",prod.loc[prod["Funding"]>upper_fence].shape[0]) #number of outliers
print("Total Records:",prod.shape[0])
Outlier count: 60 Total Records: 446
The above graph indicates heavy skewness in data, also depicting a whole lot of 60 records of Funding values as outliers.
But comparing with the sample size of just 446, the count of outliers is acounting to 13.45%
labeling more than 10% of available sample as outliers and excluding them from further analysis will greatly influence the sample data distribution hence let us try transforming the Funding data, to obtain better clarity on data
# Rescaling the funding info
prod["log(Fund)"]=pd.DataFrame(np.log(prod.Funding*1000)) #log scaling
# multiplied by 1000 to convert to K, so as to avoid negative log
fig=go.Figure()
fig.add_trace(go.Box(x=prod["log(Fund)"],jitter=1,name='log(Fund)',boxpoints='all'))
fig.add_annotation(x=7.5, y=0.5,showarrow=False,
text="BOX PLOT for FUNDS in log scale"
"log transform reduces the right skewedness")
# Get the lower fence from the box plot
# plotly doesn't feature any extracting methods
# hence lets arrive at the fence values mathematically
# also exactly matching with the plot data
q1=prod["log(Fund)"].quantile(q=0.25,interpolation='nearest')
q3=prod["log(Fund)"].quantile(q=0.75,interpolation='nearest')
IQR=q3-q1
lf=q1-(1.5*IQR)
lower_fence=np.sort(prod.loc[prod["log(Fund)"]>=lf,"log(Fund)"])[0]
uf=q3+(1.5*IQR)
upper_fence=np.sort(prod.loc[prod["log(Fund)"]<=uf,"log(Fund)"])[-1]
print("lower fence %.5f\n\
upper fence %.5f"%(lower_fence,upper_fence))
lower fence 2.88480 upper fence 12.71409
# Check number of outliers greater than upper fence
print("Number of datapoints above upper fence are %d"
%(prod.loc[prod["log(Fund)"]>upper_fence].shape[0]))
print("Number of datapoints below lower fence are %d"
%(prod.loc[prod["log(Fund)"]<lower_fence].shape[0]))
Number of datapoints above upper fence are 2 Number of datapoints below lower fence are 1
# Dropping the values that are greater than upper fence
# drop all the outliers, lower & upper outliers
# 1. visualise the outliers
display(prod.loc[prod["log(Fund)"]>upper_fence])
display(prod.loc[prod["log(Fund)"]<lower_fence])
| Startup | Product | Funding | Event | Result | OperatingState | log(Fund) | |
|---|---|---|---|---|---|---|---|
| 154 | Dropbox | dropbox.com | 1700.0 | TC50 2008 | Contestant | Operating | 14.346139 |
| 656 | Zenefits | zenefits.com | 583.6 | Disrupt NYC 2013 | Finalist | Operating | 13.276971 |
| Startup | Product | Funding | Event | Result | OperatingState | log(Fund) | |
|---|---|---|---|---|---|---|---|
| 198 | FoodStantly | foodstantly.com | 0.005 | Disrupt London 2016 | Contestant | Operating | 1.609438 |
# 2. drop the outliers
prod.drop(prod.loc[prod["log(Fund)"]>upper_fence].index, inplace=True)
prod.drop(prod.loc[prod["log(Fund)"]<lower_fence].index,inplace=True)
# Box plot after removing outliers
fig=go.Figure()
fig.add_trace(go.Box(x=prod["log(Fund)"],jitter=1,name='log(Fund)',boxpoints='all'))
fig.add_annotation(x=7.5, y=0.5,showarrow=False,
text="BOX PLOT for FUNDS in log scale"
"all datapoints are placed within the whiskers")
# Check frequency of the OperatingState features classes.
prod.OperatingState.value_counts()
Operating 316 Acquired 66 Closed 57 Ipo 4 Name: OperatingState, dtype: int64
fig=go.Figure()
fig.add_trace(go.Histogram(name='OperatingState',x=prod.OperatingState))
# Plot a distribution plot for Funds in million.
fig=make_subplots(2,1)
fig.add_trace(go.Histogram(name='Funds in Millions',x=prod.Funding),1,1)
fig.add_trace(go.Histogram(name='Funds in log scale',x=prod["log(Fund)"]),2,1)
fig.update_layout(showlegend=False)
fig.add_annotation(x=0,y=1.05,xref='paper',yref='paper',showarrow=False,
text="Funds in Millions")
fig.add_annotation(x=0,y=0.45,xref='paper',yref='paper',showarrow=False,
text="Funds in log scale")
# Plot distribution plots for companies still operating and companies that closed.
fig=make_subplots(2,2)
fig.add_trace(go.Histogram(name="Operating(Mil)",x=prod.loc[prod.OperatingState == 'Operating',"Funding"]),1,1)
fig.add_trace(go.Histogram(name="Closed(Mil)",x=prod.loc[prod.OperatingState == 'Closed',"Funding"]),1,2)
fig.add_trace(go.Histogram(name="Operating(log)",x=prod.loc[prod.OperatingState == 'Operating',"log(Fund)"]),2,1)
fig.add_trace(go.Histogram(name="Closed(log)",x=prod.loc[prod.OperatingState == 'Closed',"log(Fund)"]),2,2)
# lets describe the distribution
prod[["Funding","OperatingState"]].groupby(by="OperatingState").describe().T
| OperatingState | Acquired | Closed | Ipo | Operating | |
|---|---|---|---|---|---|
| Funding | count | 66.000000 | 57.000000 | 4.000000 | 316.000000 |
| mean | 13.212894 | 3.258170 | 137.775000 | 12.016558 | |
| std | 22.033230 | 5.923294 | 134.179988 | 32.077520 | |
| min | 0.020000 | 0.093000 | 34.900000 | 0.017900 | |
| 25% | 1.225000 | 0.475000 | 58.225000 | 0.745750 | |
| 50% | 5.800000 | 0.929200 | 91.900000 | 2.200000 | |
| 75% | 16.275000 | 3.500000 | 171.450000 | 8.125000 | |
| max | 142.000000 | 35.500000 | 332.400000 | 278.000000 |
The above description will suggest that the means and spread of Funding significantly varies between Operating & closed Companies. But, the influence of skewness could raise an ambiguity over the inference.
Hence lets review the same in log transformed data
desc=prod[["log(Fund)","OperatingState"]].groupby(by="OperatingState").describe().T
desc
| OperatingState | Acquired | Closed | Ipo | Operating | |
|---|---|---|---|---|---|
| log(Fund) | count | 66.000000 | 57.000000 | 4.000000 | 316.000000 |
| mean | 8.212889 | 7.062826 | 11.487123 | 7.731426 | |
| std | 2.030676 | 1.427570 | 0.957040 | 1.928969 | |
| min | 2.995732 | 4.532599 | 10.460242 | 2.884801 | |
| 25% | 7.110088 | 6.163315 | 10.938118 | 6.614390 | |
| 50% | 8.661884 | 6.834324 | 11.387077 | 7.696213 | |
| 75% | 9.697093 | 8.160518 | 11.936081 | 9.002687 | |
| max | 11.863582 | 10.477288 | 12.714094 | 12.535376 |
print("mean of Funds to Operating Companies: $%.2f millions\n\
with standard deviation: $%.3f millions\n"
%(np.e**desc.Operating["log(Fund)","mean"]/1000,
np.e**desc.Operating["log(Fund)","std"]/1000))
print("mean of Funds to Closed Companies: $%.2f millions\n\
with standard deviation: $%.3f millions"
%(np.e**desc.Closed["log(Fund)","mean"]/1000,
np.e**desc.Closed["log(Fund)","std"]/1000))
mean of Funds to Operating Companies: $2.28 millions with standard deviation: $0.007 millions mean of Funds to Closed Companies: $1.17 millions with standard deviation: $0.004 millions
The previous inference is supported by log transformed data also Funds allocated to Closed companies are far less than those allocated to successfully operating companies
Let us also verify the same using a 2 sample T test
Null hypothesis Ho
Alternate Hypothesis Ha
from scipy.stats import ttest_ind
s1=prod.loc[prod.OperatingState=="Operating","Funding"]
s2=prod.loc[prod.OperatingState=="Closed","Funding"]
alpha=0.05 # critical
t_res=ttest_ind(s1,s2)
if t_res.pvalue<=alpha:
print("The pvalue %.3f being within alpha %.2f,\n\
the samples are significantly different,\n\
thus we reject the Null Hypothesis Ho"%(t_res.pvalue,alpha))
else:
print("The pvalue %.3f being above alpha %.2f,\n\
the samples are not significantly different,\n\
thus we fail to reject the Null Hypothesis Ho"%(t_res.pvalue,alpha))
The pvalue 0.041 being within alpha 0.05, the samples are significantly different, thus we reject the Null Hypothesis Ho
Conclusion: The above test reiterates that the funds allocated are not similar
# Make a copy of the original data frame.
df=pd.read_csv("Company.csv")
df.head()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
# Check frequency distribution of Result variable.
df.Result.value_counts()
Contestant 488 Finalist 84 Audience choice 41 Winner 26 Runner up 23 Name: Result, dtype: int64
fig=go.Figure()
fig.add_trace(go.Histogram(name='Result',x=df.Result))
Calculate percentage of winners that are still operating and percentage of contestants that are still operating
Considering all recognised companies as winners, for the sake of analysis
conts=df.Result.value_counts().loc['Contestant']
wins=df.shape[0]-conts
op_conts=df.loc[df.Result=='Contestant',"OperatingState"].value_counts().loc["Operating"]
op_wins=df.loc[df.Result!='Contestant',"OperatingState"].value_counts().loc["Operating"]
print("Percentage of winners still operating: %%%.2f"%(op_wins/wins*100))
print("Percentage of contestants still operating: %%%.2f"%(op_conts/conts*100))
Percentage of winners still operating: %76.44 Percentage of contestants still operating: %68.03
Z test of proportions
Null Hypothesis Ho : Proportion of Winner Companies & Contestant Companies are similar
Alternate Hypothesis Ha : Proportion of Winner Companies & Contestant Companies are significantly different
from statsmodels.stats.proportion import proportions_ztest
stat,p=proportions_ztest(count=[op_conts,op_wins],nobs=[conts,wins])
alpha=0.05 # critical
if p<=alpha:
print("The pvalue %.3f being within alpha %.2f,\n\
the samples are significantly different,\n\
thus we reject the Null Hypothesis Ho"%(p,alpha))
else:
print("The pvalue %.3f being above alpha %.2f,\n\
the samples are not significantly different,\n\
thus we fail to reject the Null Hypothesis Ho"%(p,alpha))
The pvalue 0.037 being within alpha 0.05, the samples are significantly different, thus we reject the Null Hypothesis Ho
Conclusion: compannies recognised in the Startup Battlefield event have survived better than the remaining contestants
# Check distribution of the Event variable.
df.Event.value_counts().to_frame()
| Event | |
|---|---|
| TC50 2008 | 52 |
| TC50 2009 | 50 |
| TC40 2007 | 40 |
| Disrupt NYC 2011 | 32 |
| Disrupt SF 2013 | 31 |
| Disrupt SF 2011 | 31 |
| Disrupt NYC 2013 | 31 |
| Disrupt SF 2012 | 30 |
| Disrupt NYC 2012 | 30 |
| Disrupt SF 2014 | 28 |
| Disrupt SF 2015 | 27 |
| Disrupt NYC 2014 | 27 |
| Disrupt SF 2010 | 26 |
| Disrupt SF 2016 | 26 |
| Disrupt NY 2015 | 26 |
| Disrupt NY 2016 | 22 |
| Disrupt NYC 2010 | 22 |
| Disrupt Beijing 2011 | 18 |
| Disrupt EU 2013 | 15 |
| Hardware Battlefield 2015 | 15 |
| Disrupt EU 2014 | 15 |
| Disrupt London 2015 | 15 |
| Hardware Battlefield 2014 | 14 |
| Disrupt London 2016 | 13 |
| - | 13 |
| Hardware Battlefield 2016 | 13 |
fig=go.Figure()
fig.add_trace(go.Histogram(name='Event',x=df.Event))
TC50 2008 & 2009 has seen maximum number of contestants
# Select only the Event that has disrupt keyword from 2013 onwards.
flag1=df.Event.loc[df.Event!="-"].apply(lambda x:re.findall("Disrupt",x)==['Disrupt'])
flag2=df.Event.loc[df.Event!="-"].apply(lambda x:int(x[-4:])>=2013)
df.Event.loc[df.Event!="-"][flag1&flag2].value_counts()
Disrupt SF 2013 31 Disrupt NYC 2013 31 Disrupt SF 2014 28 Disrupt SF 2015 27 Disrupt NYC 2014 27 Disrupt SF 2016 26 Disrupt NY 2015 26 Disrupt NY 2016 22 Disrupt EU 2013 15 Disrupt EU 2014 15 Disrupt London 2015 15 Disrupt London 2016 13 Name: Event, dtype: int64
fig=go.Figure()
fig.add_trace(go.Histogram(name='Event',x=df.Event.loc[df.Event!="-"][flag1&flag2]))
fig=go.Figure()
fig.add_trace(go.Histogram(name='Event',x=df.Event.loc[df.Event!="-"][flag1&flag2]))